Dispensed prescription medications and short-term risk of pulmonary embolism in Norway and Sweden

Scandinavian electronic health-care registers provide a unique setting to investigate potential unidentified side effects of drugs. We analysed the association between prescription drugs dispensed in Norway and Sweden and the short-term risk of developing pulmonary embolism. A total of 12,104 pulmonary embolism cases were identified from patient- and cause-of-death registries in Norway (2004–2014) and 36,088 in Sweden (2005–2014). A case-crossover design was used to compare individual drugs dispensed 1–30 days before the date of pulmonary embolism diagnosis with dispensation in a 61–90 day time-window, while controlling for the receipt of other drugs. A BOLASSO approach was used to select drugs that were associated with short-term risk of pulmonary embolism. Thirty-eight drugs were associated with pulmonary embolism in the combined analysis of the Norwegian and Swedish data. Drugs associated with increased risk of pulmonary embolism included certain proton-pump inhibitors, antibiotics, antithrombotics, vasodilators, furosemide, anti-varicose medications, corticosteroids, immunostimulants (pegfilgrastim), opioids, analgesics, anxiolytics, antidepressants, antiprotozoals, and drugs for cough and colds. Mineral supplements, hydrochlorothiazide and potassium-sparing agents, beta-blockers, angiotensin 2 receptor blockers, statins, and methotrexate were associated with lower risk. Most associations persisted, and several additional drugs were associated, with pulmonary embolism when using a longer time window of 90 days instead of 30 days. These results provide exploratory, pharmacopeia-wide evidence of medications that may increase or decrease the risk of pulmonary embolism. Some of these findings were expected based on the drugs' indications, while others are novel and require further study as potentially modifiable precipitants of pulmonary embolism.


Appendix A
The case-crossover design A. 1 The likelihood function under the case-crossover design In our analysis, the subjects are the patients who experienced the event (case), and each patient is both the case and the control of itself, in different time periods.For the whole analysis, we used one case and one control window for each patient.According to [6], our design is called a non-localizable design and, more specifically, an unidirectional design, because the windows are placed after observing the event for each patient and not a priori.
Assume that we have a total of N individuals who experienced the event at some point in the time set {1, ..., T }.Here, T is the length of the observation period, and the t ∈ {1, ..., T } are the discrete time points where each case was checked (followed up).Let Y it be the risk occurrence indicator for subject i at time t, that is Y it = 1 if the event took place at time t and Y it = 0 otherwise.For i = 1, . . ., N and t = 1, . . ., T , let X it = (X it1 , ..., X itp ) be the p-dimensional vector of exposures at time t for the ith subject.Let further X i = (X it ; t = 1, ..., T ) be the corresponding p-dimensional exposure time series vector.
Then, if it is assumed that events follow a proportional hazards model, and if it is assumed that events are rare, we will assume that where β = (β 1 , ...β p ) is a vector of coefficients.The baseline function λ 0it is assumed to vary among subjects, carrying characteristics of the ith subject like, for example, age, or other characteristics which may change with time t.
In other words, Eq. ( 1) is the probability that the ith subject will experience the event exactly at time t, given its exposure vector and characteristics.
Assume first that the exposure of each subject is observed throughout the time set {1, 2, ..., T }.Then by conditioning on the occurrence of exactly one event in {1, 2, ..., T }, and as above assuming that events are rare, the contribution to the likelihood from the ith subject would be ( [8,6]): The complete likelihood is then the product of these contributions, namely λ 0is e X is β . ( Since we do not know the exposure X it for all t ∈ 1, 2, ..., T , but only in a referent window W i , following Lumley and Levy [8], we replace the denominators in Eq. (3) by the simplification where s runs only in the referent window.Thus, we replace Eq. ( 3) by In our application, W i = {t i , s i }, where t i and s i are the case and the control windows of the ith subject, respectively.Furthermore, the λ 0it i and λ 0it are cancelled because they are assumed to be (approximately) constant in the referent windows.Now, X it i is a vector of length p corresponding to the exposure of each drug in the case window, while X is i is a vector of the same length corresponding to the exposures in the control window.Both vectors contain the values 0 and 1 for non-exposure and exposure, respectively.
It should be noted that Eq. ( 4) is not an actual likelihood because of the reduction of the sum in the denominator of (3).Thus, as noted by Lumley and Levy [8], the resulting estimates may be biased.This bias, called overlap bias, will however usually be small (see [8]).
On the other hand, Eq. ( 4) has the form of the likelihood of a conditional logistic regression (CLR) and can hence be analyzed using standard statistical packages allowing CLR, Note that the window W i in Eq. ( 4) consists of two time points, t i (case) and s i (control), while more controls can be included by adding points in the window.

A.2 Generalized linear model
Our model can be represented differently in order to show that it is a generalized linear model of logistic type.Consider the matrices X t = (X 1t 1 , ..., X N t N ) T , corresponding to the case windows, and X s = (X 1s 1 , ..., X N s N ) T , corresponding to the control windows, both of dimension N ×p, where X it i and X is i are defined above.Consider also the corresponding response vectors T , for the cases and controls, respectively.These are both N -dimensional, where the response vector for the cases contains only the element 1 and the response vector for the controls contains only the element 0. This is because we restrict attention to patients with exactly one event in the given time set.The likelihood L(β) can hence be written in the form where X i = X it i − X is i , that is, we subtract the control vector from the case vector for each patient.As will be seen below, for a case-crossover design with one case and one control window, the likelihood can be written in the form of the unconditional likelihood for a binary logistic regression with no intercept and constant response equal to one (see Avalos et al. [3]).The fact that the response is constant and equal to one, becomes more clear when we subtract the response vector Y s from Y t .Note that now the data matrix X consists of values −1, 0, 1.Furthermore, for a specific component β j of the vector β, e β j is the relative risk for the event, if the jth component of X i is increased by 1, which means that the corresponding medicine is taken in the case window and not in the control window.
By letting p i be the contribution of the ith individual to the likelihood in Eq. ( 1+e X i β , then the likelihood can be written in the following form: since y i = 1 for all i.The log-likelihood is of the form which corresponds to the likelihood of logistic regression ([9, 10]) Finally, the log-likelihood function given in Eq. ( 7) is a concave function with respect to the β coefficients.This can easily be seen by rewriting the function in the following form, and noting that − log(1 + e x ) is a concave function.

A.3 The lasso method and glmnet
In this section we describe how we estimated the coefficients in β using the lasso (the least absolute shrinkage and selection operator) method ( [11]).More specifically, we estimated the coefficients under the lasso penalty by minimizing the negative penalized log-likelihood in Eq. (7), that is, The estimation was done via a coordinate descent algorithm using the glmnet package ( [5]) in the statistical package R. We chose the binomial family with logit link function in the settings of the glmnet function.Furthermore, the choice of λ was done by generating a sequence of 100 λ values and then choosing the best, using 10-fold cross-validation.All of those functions are provided by the package.
The glmnet function for the binomial model requires a two-level response.In our case we have, however, a constant response equal to one.To overcome this problem and allow the use the glmnet function, we added N rows of zeros to the data matrix X, and N new observations to the response vector Y , all with the value zero.We demonstrate below that this does not change the likelihood function, and since the response vector now contains both zeros and ones, the estimation of β can be done using the glmnet function, To see why the above claim is true, let j correspond to lines of the extended data matrix X which were added, while i corresponds to lines of the original data matrix.Then p j = 1/(1 + e 0 ) = 1/2, while p i = 1 1+e −X i β , as before.Hence the likelihood in Eq. ( 6), using the extended data matrix, becomes which equals the original likelihood in Eq. ( 6) multiplied by a constant independent of the parameters.Hence, the estimated coefficients are not affected by the modification.We chose N additional observations with zero response because, as we will discuss in Appendix A.4, we use bootstrapping to sample from the data matrix X.We found that N extra observations are enough to ensure both 0 and 1 responses in each bootstrap sample.If we alternatively added, say, only one extra observation, we would most probably get some bootstrap samples with only 1 as a response, and thus the glmnet function would not work.However, as already noted, the actual number of extra observations with zero response does generally not affect the estimated coefficients.

A.4 Bolasso
Lasso sometimes allows a few irrelevant variables to enter the model ( [3]).Further, according to Tibshirani [11], the standard errors for the lasso estimates are difficult to acquire.Those problems can be reduced by bootstrapping.Tibshirani [11], Avalos et al. [3] and Avalos et al. [2] suggested a bootstrap version of lasso which is called bolasso, which reduces the uncertainties of the lasso estimates.Bolasso runs a bootstrap on many samples, and estimates the parameters for each sample, using the lasso approach.
In our analysis we ran a bootstrap on the lasso model described in Section A.3, using 1000 bootstrap samples.For each bootstrap sample, a new λ sequence was generated by the glmnet package and the optimal λ was chosen using 10fold cross validation.Then the model was fitted by glmnet using the current bootstrap sample.
The β-coefficients frequently selected by the lasso from the bootstrap samples are taken into account, while the others are set to zero [3].Avalos et al. [3] suggested using Akaike's information criterion ( [1]), AIC = −2 ( β) + 2k, for estimating the frequency threshold of bolasso.Here ( β) is the log-likelihood function applied on the estimated coefficients from each model under selection, and k is the total number of non-zero elements in the β vector.AIC corresponds to an estimation of the allowed number of times that a coefficient could have been set to zero among the bootstraps.We can compute the AIC for different models and then choose the model which gives the lowest AIC-value.
The output from the bolasso in our analysis is a matrix of dimension p × B. Each column of the matrix corresponds to one bootstrap replicate, that is, the estimated vector of coefficients from that bootstrap sample.Each row of the matrix corresponds to the values that the specific coefficient took among the bootstrap samples.For each row in the matrix, we computed the total number of zeros and we created a frequency vector of length p.For each value in the frequency vector we created a model by checking which of the coefficients had been set to zero less times than the chosen threshold.For those, their mean among the bootstraps was taken as an estimate, that is where β * b j is the b-th bootstrap estimate of coefficient β j and B is the total number of bootstrap replicates ( [4]).The other coefficients are simply set to zero.Next, the AIC was computed for the corresponding model, where k is the number of non-zero estimates in the newly computed β vector.Finally, this is done for all thresholds in the sequence and the one which gives the lowest AIC is chosen as optimal.For that optimal threshold we again check which of the coefficients have been set to zero less times than the optimal threshold and we compute βj as before, while the others are set to zero.This vector of coefficients is the bolasso estimate of the coefficients for the objective function ( [3])., as well as 95% confidence intervals based on the central limit theorem ( βj − 1.96 Ŝ D j , βj + 1.96 Ŝ D j ) ( [4]).However, according to Lockhart et al. [7], con-fidence intervals obtained by the bootstrap estimates are not optimal for the lasso.

Finally, we canβj 2 /(B − 1 )
compute the standard error of each coefficient as Ŝ D j = B b=1 β * b j −